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PATTERN  PROCESSORS  FOR  LANGUAGE  ABDUCTION 
7.1.  Abduction  of  regular  struoturer. . 

In  Chapter  6 we  studied  a network  that  could  modify 
Itself  In  order  to  learn  at  least  some  of  the  patterns  appearing 
In  Its  environment.  This  was  done  by  modifying  the  coupling 
coefficients  of  l.e.  by  changing  the  image  processor  that  yf' 
represents.  The  resulting  pattern  Inference  is  of  Inductive  type: 
with  the  aid  of  observations  of  what  happens  In  the  environment 
the  network  will  Incorporate  more  and  more  of  the  surrounding 
pattern  structure. 

We  shall  now  turn  to  pattern  processors  that  also  carry  out 
Inference  about  the  structure,  but  In  a way  emphasizing  the  explicit 
generation  of  plausible  hypotheses.  Of  course  the  previous  network 
processor  can  also  be  said  to  generate  hypotheses.  Indeed,  the 
changes  in  the  geometry  (see  section  6.6  ) can  be  Interpreted  as 
giving  greater  credibility  to  some  statements,  or  hypotheses, 
about  the  image  algebra  env(fi),  while  other  statements  are  made  to 
appear  less  likely.  This  sort  of  hypothesis . is  implicit,  however, 
in  contrast  to  what  will  be  studied  here. 

Using  a term  coined  by  C.S.  Peirce  (see  Peirce  (1955),  p. 150-156), 
we  shall  speak  of  abduction  when  the  processor  is  applied  to 
images  from  an  incompletely  known  pattern  structure  in  order  to 
generate  as  output  plausible  hypotheses  concerning  the  structure. 
Perhaps  one  could  also  call  it  "plausible  reasoning",  adopting 
Polya's  terminology,  but  if  so,  only  for  the  limited  context  that 
we  shall  describe  in  the  next  section. 
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I Denote  the  incoming  images  by  I , . . . , all  from  some 

Image  algebra  , and  where  the  Images  will  be  assumed  to  be 
pure  for  the  moment;  the  modifications  needed  when  they  are 
deformed  are  far  from  trivial  but  the  discussion  of  this  will  be 
postponed  till  later. 

It  should  be  pointed  out  that  the  output  of  the  abduction 
algorithm  (or  abduction  machine)  should  be  pattern  structures,  not 
Just  individual  images  for  each  input  image.  In  other  words  we 
are  not  looking  for  a single  image  operator  that  realizes  a 
certain  task  well,  such  as  image  restoration.  Instead  we  operate 
one  level  of  abstraction  higher  and  the  mapping  is  of  the  form 


(1.1)  ABDUCTION:  ...  -»■  n 

where  n stands  for  a collection  {^}  of  pattern  structures  > let  us  say 

image  algebras.  Of  course  II  should  not  he  completely  general,  the 

choice  of 

^'s  is  limited  by  what  we  know,  a priori  about  generators,  bond 

relations,  connection  types,  etc.  The  other  extreme,  that  IT  consists 

just  of  two  or  a few  possible  .^'s  is  also  of  limited  interest  so  that 

n will  be  assumed  to  contain  A large  or  even  infinite  set  of  pattern 
structures . 

Concerning  the  number  of  factors  in  the  Cartesian  product  of 
the  left  hand  side  of  (1.1)  we  shall  assume  that  it  is  finite,  but 
also  that  the  abduction  algorithm  should  be  sequential , in  the  sense 
that  when  another  input  image  arrives  we  should  only  have  to  do  a 
moderate  amount  of  additional  computing  to  get  a new  hypothesis 
over  what  we  already  computed. 

More  Importantly,  we  shall  look  for  pattern  processors  that 
are  robust , insensitive  to  errors,  and  natural . Robustness  means 
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' that  even  though  the  algorithm  has  been  designed  to  work  well  for 
a certain  type  of  pattern  structure,  it  will  not  break  down  completely 
when  exposed  to  a slightly  different  type.  Its  performance  may 
deteriorate  but  it  should  not  cease  functioning.  Insensitivity  to 
errors  means  that  occasional  errors  in  the  computing  or  in  the 
inputs  should  not  have  any  serious  lasting  effects  but  their 
Influence  should  die  out  as  more  images  are  processed.  The 
concept . "natural"  is  more  difficult  to  make  precise  and  what  seems 
a natural  algorithm  to  one  researcher  may  appear  as  artificial  to 
another.  Anyway,  we  shall  try  to  choose  the  algorithms  in  such  a 
way  that  it  is  conceivable  that  they  may  be  implemented  in  real 
world  pattern  processors.  More  about  this  later. 

The  reason  why  we  emphasize  these  three  properties  as  well 
as  the  inductive,  rather  than  the  deductive,  type  of  logic  is  that 
we  are  looking  for  models  with  potential  applicability  to  processors 
occurring  in  the  real  world  even  though  they  may  be  unrealistic  in 
their  details  at  present,  see  Vol.  1,  p.266.  On  the  other  hand 
we  shall  pay  less  attention  to  other  desirable  properties,  such  as 
computational  efficiency,  speed  of  convergence,  etc. 

How  would  one  go  about  the  task  of  Inferring  the  image 
algebra  ^£11  having  been  given  the  sequence  of  images  1^^ , I2 , I^  > . ♦ . ? 
If  n is  denumerable,  so  that  we  can  write 


(1.2)  n = . . . } 


we  could  start  testing  as  a hypothetical  pattern  structure  the  image 
algebra  ^ for  I^,  and  if  I^E.^  continue  by  12,1^,...  As  soon  as  we  reach 
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an  image  for  which  we  go  the  next  image  algebra  ^ and 

start  again  testing  for  G ^2>  I2  £ and  so  on.  We  stop 

when  we  reach  an  image  algebra^*  E n for  which  no  test  fails. 

To  be  able  to  guarantee  that  this  leads  to  the  correct 
hypothesis  ^ in  a finite  number  of  steps  we  need  in  general 
access  to  an  unlimited  (potentially  infinite)  sequence  of  images 
as  well  as  some  guarantee  that  the  sequence  is"representative" 
for  the  whole  image  algebra  . In  the  following  we  shall  let 
the  sequence  be  generated  by  a simple  random  source  according  to 
some  probability  measure  P over  We  must  then  of  course  require , 

in  general,  that  the  support  of  P is  the  entire 

We  could  then  try  to  prove  theorems  of  convergence,  saying 
that  the  algorithm  will  converge  to  in  a finite  number  of  steps 
with  probability  one.  Such  results  can  be  obtained,  assuming  for 
example  that  each  ^ is  denumerable,  but  we  shall  not  pursue  this 
line  of  thought. 

Indeed,  such  an  algorithm  can  scarcely  be  said  to  be  natural: 
it  amounts  to  no  more  than  trial  and  error.  Furthermore  it  does 
not  represent  abduction  as  described  above,  since  the  successive 
hypothesis  have  not  been  generated  to  be  plausible  with  respect  to 
the  observed  sequence,  they  are  Just  given  by  a fixed  sequence. 

In  order  to  select  plausible  hypotheses  let  us  rephrase  the 
problem  in  accordance  with  current  statistical  doctrine.  Given  a 
finite  section  I^,l2,...I^  of  the  sequence  of  images,  think  of  the 
true  image  algebra  ^ as  an  unknown  "parameter"  to  be  estimated. 

We  can  then  ask  for  the  "best"  estimator  according  to  a specified 
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criterion  and  apply  statistical  estimation  techniques.  Our 
"parameter" ^ can  of  course  be  something  rather  different  from 
and  more  complicated  than  what  is  the  case  in  standard  statistical 
estimation  theory,  but  we  may  still  be  able  to  solve  such  an 
estimation  problem  successfully. 

The  resulting  estimator  can  then  be  said  to  be  plausible 
since  it  utilizes  the  given  information  as  well  as  possible  to 
select  the  hypothesis.  The  naturalness  of  such  algorithms  may  be 
less  convincing,  though.  Such  estimation  techniques  may  be  based 
on  the  method  of  maximum  likelihood,  on  Bayesian  ideas,  on  the 
principle  of  least  squares  etc.,  and  in  order  to  carry  out 
maximization  they  may  employ  numerical  schemes  like  Newton's 
method,  or  ca>?ry  out  matrix  inversion,  or  use  search  strategies  to 
find  a maximum.  The  idea  that  human  intelligence,  in  particular 
language  learning,  would  be  based  on,  say,  the  ability  of  the  human 
mind  to  do,  for  example,  matrix’  inversions  is  less  than  appealing.  We 
have  to  look  elsewhere  for  more  natural  abduction  algorithms. 

When  the  images  in  the  sequence  are  encountered  one  could 
attempt  inference  of  if  the  bond  structure  and  bond  relations 


could  be  observed,  since  this  would  give  information,  at  least 
partial,  about  the  connection  type  1 and  bond  relation  p . Then 
the  combinatory  rules  <I,p>  could  be  Inferred  to  some  degree. 

Unfortunately  this  is  seldom  possible  since  the  internal  bonds 
are  usually  not  directly  observable;  Chapter  3 of  Volume  1 contains 
many  examples  of  this.  To  avoid  this  difficulty  let  us  assume  that 
the  learner  has  access  to  a teacher  in  addition  to  the  pure  image 
sequence.  The  teacher's  role  is  to  help  the  learner  by  telling  him 
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whether  other  images  belong  to  ^ or  not.  Then  we  could  base 
the  abduction  on  intentional  deformations  of  the  by  applying 
one  or  several  deformations  to  each  and  ask  the  teacher 
whether  the  deformed  image  is  still  within  ^ . The  answers  will 


contain  Information  about  the  internal  bonds  although  it  will  be 
less  complete  than  would  be  the  case  if  the  whole  conf Iguratlon 
wereavailable  for  observation.  Summing  up,  the  problem  has  now 
been  reformulated  as  follows. 

Case  7.1:1  (inference  by  deformations).  Given  a sequence 

• • • of  pure  images  from  ^ and  a deformation  mechanism 

use  the  knowledge  of  whether  or  not  to  generate 

hypotheses  about  the  pattern  structure. 

This  is  still  too  general  to  be  of  any  real  help  since  it 
does  not  say  how  to  choose  Q)  . We  have  seen  in  Volume  1,  Chapter 
there  is  a rich  variety  of  deformation  mechanisms  and  we  now  have 
to  narrow  down  the  choice. 

Recalling  that  we  are  looking  for  information  about 
<E,p>  , it  seems  natural  to  use  a that  leaves  much  of  the 
image  unchanged  and  only  modifies  a sub-image  involving  only  one  or  a 
few  bonds.  In  case  is  not  in  ^ we  should  concentrate  our  attention 

to  the  sub-image  mentioned  and  its  connection  to  the  rest  of  Ij^.  More 
Case  7.1;  2 (deforming  sub-images).  ^ is  said  to  deform  sub- 
images  if,  for  any  IG^,  the  image  is  broken  up  into 

(1.3)  I = ^1 

and 
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Note  that  this  leaves  the  external  bond  structure 
as  before  but  not  always  the  external  bond  values.  The 
reader  may  notice  that  we  have  encountered  some  related  versions 
of  Case  7.1:2  before.  Indeed,  a Jittering  deformation  (see 
section  h.2  in  Volume  1)  restricted  to  s^^  = e for  i=m,m+l, . . . ,n 
deforms  only  the  sub-image  containing  the  generators 
m < n.  The  missing  generator  deformations  (see  Case  2.9,  Volume  1) 
also  belongs  to  this  type:  it  annihilates  a certain  sub-image. 

In  order  that  we  learn  something  from  observing  the 
regularity  of  we  must  ask  that  the  deformed  images  are 
occasionally  outside  i.e.  ^ should  be  heteromorphlc . Other- 
wise, for  automorphic  deformations,  we  could  not  hope  to  find 
the  true  limitations  upon  the  combinatory  pattern  structure. 

For  example,  a shift  deformation  (see  equation  (2.1)  in  section  ^.2 
in  Volume  1)  would  not  be  powerful  enough.  On  the  other  hand 
Case  2.H  in  Volume  1 is  a good  example  of  a deformation  that  would 
seem  a promising  candidate  for  the  abduction  algorithm. 

Another  way  of  looking  at  the  image  relations  in  (1.3)  and 
(1.4)  is  in  terms  of  congruence  (see  Volume  1,  p.  31).  If  I^ 
congruent  with  I^  then  will  of  course  be  regular,  in  But 

if  I^  is  not  congruent  with  then  I may  be  irregular,  not 
always,  but  for  some  I^.  Therefore  the  teacher's  answer  to  the 
question  whether  I^Ey  will  tell  us  something  about  the  congruence 
relation  and  hence  about  the  pattern  structure.  Due  to  the  "not 
always"  we  cannot  claim  with  certainty  to  have  arrived  at  a correct 
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othesis , we  are  doing  induction  and  not  deduction. 


Let  us  pursue  this  line  of  thought  only  a bit  further  here. 

Suppose  that  we  tried  jittering  deformations  of  sub-images,  we 
would  run  into  the  difficulty  that  the  similarity  group  S may 
not  be  completely  known  to  us  a priori.  To  make  sure  that  the 

A 

deformations  be  drastic  enough  we  should  look  for  a set  S of 

Qj 

mappings  ^ containing  S,  S c S,  but  probably  larger  since  shifts 

A 

may  not  be  sufficient.  We  v/culd  then  select  elements  of  S, 

apply  them  to  a sub-image  and  observe  the  regularity  or  irregularity  | 

of  the  deformed  image.  We  shall  return  to  this  in  more  detail  in  the  ' 

next  section. 

■ After  we  have  constructed  an  abduction  algorithm  the 

question  arises  how  to  implement  it  by  physical  devices.  Expecially  : 


network  processors  like  those  in  Chapter  6 present  themselves  as 
natural  candidates  for  functioning  as  abduction  machines.  This 
will  be  attempted  in  section  7-3 


/ 


7 . 2 Abduction  of  some  lanpjuage  patterns. 


Based  on  the  general  considerations  of  the  last  section  we 
shall  now  attempt  to  construct  abduction  algorithms  when  the 
patterns  come  from  some  formal  language.  ']’h  i s will  give  a concrete 
illustration  of  how  inference  by  deformations  can  be  obtained 
using  Case  7.1:2. 

Before  deciding  what  type  of  formal  language  we  shall  use, 
let  us  consider  briefly  the  flow  of  information  when  we  attempt 
abduction.  The  speaker  lives  in  an  environment  characterized 
by  some  image  algebra  env(fi)  in  Figure  7.2.1.  A given  image 
lEenv(n)  can  give  rise  to  many  different  sentences  belonging  to 
a language  L(Gr)  described  by  a grammar  Gr.  This  means  that  an 
image  processor  maps  the  image  algebra  env(P)  into  another  image 
algebra,  so  that  the  image  operator  "transducer"  in  the  figure 
takes  microworld  images  into  language  images. 

As  an  example  of  how  such  an  image  operator  may  work  the 
reader  is  referred  to  section  2.4.  Of  course  the  mapping  will  be 
one-to-many,  since  a given  image  I in  env(,G)  can  give  rise  to 
many  syntactically  correct  sentences,  following  Gr,  and  agreeing 
with  I semantically . 

The  sentence  from  L(Gr)  is  then  subjected  to  the  deformation 
mechanism  Q , the  second  image  operator  in  the  figure,  that  changes 
a sub-image.  The  result  is  presented  to  the  teacher  and  is  also 
stored  in  the  short  term  memory  of  P together  with  the  instruction 
from  the  teacher  and  the  undeformed  sentence.  Say  that  the 
teacher  only  answers  yes  or  no,  according  to  the  grammatlcality  of 
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the  sentence.  The  grammatlcallty  function  will  be  denoted  grC*) 
and  is  of  course  not  known  to  a priori.  It  takes  the  values 
YES  and  NO.  The  abduction  algorithm  processes  these  three  inputs 
and  saves  the  result  in  some  form,  yet  to  be  specified,  in  the 
long  term  memory  after  which  the  short  term  memory  is  cleared. 

As  more  and  more  sentences  are  processed  the  algorithm  is  expected 
to  converge  to  a limiting  grammar  weakly  equivalent  to  Gr,  and 
with  performance  parameters  that  characterize  the 
probability  distribution  over  L(Gr). 

In  this  chapter  we  shall  study  abduction  of  the  language 
L(Gr)  when  no  semantic  input  is  available,  so  that  in  Figure  7.2.1 
env(n)  is  not  connected  to  memory  by  the  dotted  lines.  The 
situation  with  semantic  input  from  the  image  algebra  env(I2)  is  a 
challenging  research  problem  which  will  not,  hov;ever,  be  studied 
here . 

In  the  figure  we  have  two  image  operators:  the  transducer 

with  sentences  from  L(Gr)  as  output  and  ^ , the  deformation 
mechanism  that  was  discussed  briefly  in  the  last  section,  whose 
outputs  are  strings  over  the  same  terminal  vocabulary  as  used  in 
its  inputs.  The  specification  of  the  abduction  algorithm  should 
define  the  latter  one  in  detail  and  we  now  address  ourselves  to 
this  question. 

First  we  must  decide  what  type  of  grammars  to  use  here. 

A simple  but  not  trivial  type  is  the  finite  state  grammar  and  this 
is  what  we  shall  use  for  Gr;  see  Motes  for  further  discussion  as 
well  as  for  historical  remarks. 


We  use  the  same  assumptions  and  notation  as  in  Volume  1, 

sections  2,4,  2,10,  and  3-2,  The  terminal  vocabulary  contains 

n^p  words  denoted  generlcally  as  x,y,,,,  or  these  letters 

subscripted  as  needed.  The  syntactic  variables,  the  non-terminals, 

form  a set  V.,  with  n.,  elements,  denoted  by  i,j,.,,  or  these  letters 
N N 

subscripted  as  needed.  The  rewriting  rules  are  of  the  form 
i xj;  X € V^;  i,J  E 

(2,1) 

^ i - X ; X E V^;  1 G Vj^ 

the  latter  type  resulting  in  termination  of  the  derivation.  The  number 
of  rewriting  rules  is  denoted  n^.  The  corresponding  probabilities 


are  denoted  Ppj(x)  and  r^Cx)  forming  a rnati’ix  P(x)  and  a vector  r(x) 
respectively.  Also 
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f P = E P(x) 

xtV^ 

r = E r(x) 

x€V^ 

With  the  usual  assumptions  the  consistency  question  of  the  syntax- 
controlled  probability  model  is  automatically  answered  in  the 
affirmative,  see  Theorem  10.7  of  Chapter  2,  Volume  1>  P-90.  Hence  there 
exists  a well-defined  probability  measure  over  . Recall 

that  regular  conf igurations  here  means  linear  strings  over  V^j, 
and  that  bonds  take  values  in  Vj^.  Of  course  the  internal  bonds 
have  to  satisfy  the  bond  relation  p = EQUAL.  The  corresponding 
images  are  "phrases"  with  one  in-bond  and  one  out-bond. 

A generator  is  a rewriting  rule , as  in  (2.1),  so  that  it  can 
be  regarded  as  an  element  from  V^p,  a \-jord,  together  with  its  in- 
and  out-bonds,  (i,j)  or  (i,F)  where  F represents  the  final  state. 

The  Initial  state  will  be  chosen  as  i=l. 

It  is  tempting  to  think  of  a generator  as  just  a word  from 
V^.  This  is  not  correct  since  it  can  very  well  happen  that  one 
and  the  same  word  x appears  in  two  different  rewriting  rules 
1 ->■  xj  and  1'  -*■  x,j'.  Therefore  the  mapping  •+  G can  be  one-to- 
many  which  will  have  important  consequences  later  on. 

Let  X and  y be  fixed  in  and  pick  two  arbitrary  strings 
u^vGVy.  If  it  is  always  true  that  the  concatenated  strings  uxv  and  uyv 

are  grammatical  or  ungrammatical  together  we  say  that  x e y,  they 
are  equivalent  (or  congruent).  In  other  words 

* 

(2.3)  X H y <=>  gr(uxv)  = gr(uyv),  Vu,v£V^  . 
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This  equivalence  partitions  V^,  into  equivalence  classes.  Note 
that  equivalence  in  (2.3)  demands  tliat  the  richt  hand  side  hold 
for  all  u,v.  Hence  an  Infinite  number  of  tests  would  be  required 
since  V*  is  infinite.  Testing  it  for  a single  case  (u,v)  or  a 
finite  number  of  cases  does  not  suffice.  Nevertheless  one  feels 
that  if  the  relation  holds  for  many  (u,v)-combinations  then  x 
and  y are  likely  to  be  equivalent,  In  some  sense  that  has  not,  yet 
been  made  precise. 

We  shall  need  the  following  simple  statement. 

Lemma  2.1.  In  order  that  x = y it  is  necessary  and  sufficient  that 

for  any  generator  of  the  form  1 x j there  exist  one  of  the  form 

1 yj  and  vice  versa. 

Proof:  Consider  two  words  x and  y,  and  strings  uxv  and  uy v . If 

for  any  generator  i xj  there  is  one  i ->  yj  it  is  clear  that 

* 

gr(uxv)  = gr(uyv),  and  since  this  holds  for  any  u,  v G it  follows 
X = y. 

On  the  other  hand,  let  us  choose  x and  y such  that  x = y. 

If  uxy  is  grammatical  and  u takes  the  Initial  state  into  the  i^^ 
state,  X takes  1 into  j,  and  v takes  J into  F,  then,  in  order  that 
uyv  also  be  grammatical  it  is  necessary  that  the  internal  bonds  fit. 

Hence  there  must  be  a rewriting  rule  of  the  form  1 -*■  yj  and  the 

proof  is  complete. 

To  emphasize  the  equivalence  property  we  choose  as  our  similarity 
transformations  the  set  of  all  those  permutations  of  generators  that 
leave  the  equivalence  unchanged,  so  that  if  g = i -*•  xJ  then 
sg  = 1 ->■  x'j  with  X E some  x'.  This  determines  the  generator  classes 
g“  invariant  with  respect  to  S. 
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Of  course  S is  unknown  initio  and  should  be  learnt  during 
the  abduction  process.  We  will  start  with  some  set  S of  transforma- 
tions  of  the  images,  where  S need  not  be  a gi-oup.  As  our  first 
task,  to  be  carried  out  in  sections  7.3-7.^,  we  shall  take  the 
determination  of  the  (unknown)  word  classes. 

To  make  the  following  as  concrete  as  possible  we  shall  use 
a "test  grammar".  Of  course  we  could  have  chosen  one  using  a 
vocabulary  consisting  of  abstract  symbols  since  we  are  not 
concerned  with  natural  language  processing  here.  For  didactic 
reasons,  however,  v;e  have  Instead  selected  one  generating  English- 
like  strings  so  that  the  output  is  easier  to  read.  Since  the 
semantic  background  has  been  left  out  it  will  be  necessary  to 
"fudge"  the  grammar  to  avoid  completely  meaningless  sentences  from 
being  grammatical. 

The  grammar  has  n^p  = 52  Including  the  punctuation  mark 
and  a list  of  the  terminal  vocabulary  is  given  in  Table  2.1. 

These  52  "words"  are  arranged  in  23  word  classes  denoted  by,  for 
example,  DET  for  determines,  NH  for  human  norm  and  so  on. 
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Class  Code 


Word: 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 


n 


2 
2 
2 

2 J 

3 

3 

4 
4 

4 

5 
5 

5 

6 
6 
6 
6 

7 
7 
7 

7 

8 
8 
8 

9 
9 


10 

10 

10 

11 

11 

12 

12 

13 

13 

14 

14 

15 

16 
16 

17 

17 

18 
18 


DET 

AJH 

AJA 

AJIN 

AJ2N 

NH 

NA 

NN 

AUX 

VP 

VT 

VI 

CONJ 

PRH 

PRN 

PNH 

PNA 

ADV 


*A 

THE 

SOME 

«TALL 

CLEVER 

SHORT 

YOUNG 

*SPOTTED 

FRISKY 

*FINE 

NEW 

VALUABLE 

«BLUE 

ORANGE 

GREEN 

*MAN 

BOY 

WOMAN 

GIRL 

*CAT 

KITTEN 

DOG 

PUPPY 

*TABLE 

CHAIR 

DESK 

*IS 

WAS 

*SEEN 

HURT 

HELPED 

* LIKES 
DISLIKES 

*SPEAKS 

SINGS 

*AND 

WHILE 

*HE 

SHE 

*IT 

*MARY 

JOHN 

*TOUKA 

ROVER 

» IMMENSELY 
VIOLENTLY 


19  1 

1 

*SAYS 

48 

19  J 

1 VC 

CLAII 

49 

20 

REL 

»TI1AT 

50 

21 

BY 

»BY 

51 

22 

NOT 

*NOT 

52 

23 

DOT 

* . 
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The  test  grammar  Gr  has  19  states  including  the  final  one 
F=19  as  in  Figure  2.1.  This  corresponds  to  the  generators  listed 
in  Table  2.2,  where  for  example  1 PNA,7  really  represents  two 
re-writing  rules  since  the  word  class  PNA  contains  two  words. 

In  all  we  have  87  rewriting  rules. 

A program  generates  sentences  from  L(Gr)  as  described  in 
section  3.2  of  Volume  1.  The  performance  will  of  course  depend 
upon  the  probabilities  associated  with  the  generators.  Many  of 
the  sentences  are  quite  reasonable,  such  as  HE  IS  HELPED  BY  A BOY, 
or  JOHN  SPEAKS,  or  THE  DESK  IS  NOT  BLUE.  Some  are  a bit  doubtful, 
such  as  JOHN  CLAIMS  THAT  JOHN  SINGS,  or  SOME  VALUABLE  TABLE  IS 
NOT  GREEN.  More  seldom  one  gets  a very  strange  sentence,  for 
example,  HE  CLAIMS  THAT  THE  MAN  CLAIMS  THAT  A WOMAN  IS  HELPED  BY 
THE  DOG,  or  SHE  VIOLENTLY  LIKES  THE  BOY  WHILE  HE  SPEAKS.  It  may 
also  be  mentioned  that  some  perfectly  reasonable  looking  English 
sentences  over  the  given  terminal  vocabulary  are  not  accepted  by 
Gr,  for  example  ROVER  LIKES  THE  GIRL.  For  our  purpose  the  grammar 
represents  a sufficiently  difficult  task  however. 

Consider  now  the  four  generators  of  the  form  11  -*•  NH,12. 

The  words  in  NH  are  certainly  equivalent  to  each  other.  Similarly 
the  four  generators  of  type  11  -►  NA,12  use  the  words  NA  which  are 
equivalent  to  each  other.  All  these  eight  generators  go  from 
state  11  to  12  and  one  may  be  tempted  to  believe  that  the  elements 
in  NA  are  equivalent  to  the  elements  in  NH.  This  is  not  the  case 
however,  since  Lemma  2.1  tells  us  that  in  order  that  this  hold  we 
must  have,  for  example,  for  the  generators  2 ->  NH,6  generators  of 
the  form  2 NA,6.  The  latter  ones  do  not  appear  in  Gr,  so  that 
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generator  number 
1 

2,3 

‘»,5,6 

7,8 

9,10 

11,12,13,1^ 

15,16 

17,18,19 

20,21,22,23 

2i<,25,26,27 

28,29,30 

31,32,33,3iJ 

35,36,37,38 

39,^0,iJl 

^2,i<3 

^6,n 

'48,i<9 

50,51 

52,53 

5^,55 

56,57,58 

59,60,61 

62 

63,6^1,65,66 

67,68,69,70 

71,72 

73 

7^1,75,76 

77 

78 

79,80,81 

82,83,8U 

85,86 

87 


Table  2.2 

generator 

1  pnri,8 
1 ->  PHA,7 
1 dl-:t,2 

1 PRH,6 

1 -*■  PNH,6 

2 ^ AJH,3 
2 AJA,ll 

2 -»■  AJ1M,5 
2 ->  NH,6 
2 na,7 

2 NN,8 

3 Nil,  6 
ll  NA,7 

5 NN.8 

6 VT,9 
6 VI,  12 
6 ->  AUX,13 
6 ->  ADV,17 

6 VC,  18 

7 AUX,13 

8 AUX,10 

9 DET,11 

10  AJ2N,12 

10  N0T,l6 

11  NH,12 

11  -»  NA,12 

12  C0NJ,1 

12  -»■  • ,P 

13  VP, 111 
13  NOT,  15 
I'l  - BY, 9 

15  VP, 111 

16  -»  AJ2N,12 

17  VT,9 

18  REL,1 


1 


1 


f 


5 -I 

I ''C*v 

V ^ 


7. 

the  equivalence  between  NA  and  Nil  does  not  hold,  which  will 

Introduce  an  essential  difficulty  which  will  be  studied  in 
next  section. 


! t 

i ‘ 


1 

J 

j4 


1 
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the 
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7.'3.  Word  class  partitioning.  A crucial  part  of  the  abduction 
algorithm  tries  to  find  the  partition  of  into  classes  of 
equivalent  words.  If  this  can  be  achieved  we  have  reduced  the 
"combinatorial  size"  of  the  task  considerably  since  we  can  then 
operate  on  the  level  of  pre-terminals  rather  than  terminals. 

In  the  test  grammar  the  number  is  then  reduced  from  52  to  23. 

In  natural  language  it  is  likely  that  the  reduction  would  be  even 
greater . 

But  this  is  not  the  only  reason  why  we  shall  pay  so  much 
attention.  We  shall  see  later  that  the  partitioning  problem 
presents  the  main  mathematical  difficulty:  once  it  has  been  solved 

the  full  abduction  problem  for  finite  state  grammars  can  be 
obtained  by  a similar  construction.  Therefore  we  shall  examine 
this  sub-problem  in  considerable  detail. 

Let  us  look  at  what  is  a real  obstacle  preventing  us  from 
using  one  of  the  standard  algorithms.  Consider  two  words  x,y€V^ 
and  ask  whether  they  are  equivalent  or  not.  For  this  purpose  we 
assume  that  some  test  procedure  has  been  arrived  at  that  will  be 
applied  to  a given  sentence  I generated  according  to  the  syntax- 
controlled  probability  model. 

As  describe  in  the  previous  section  the  sentence  I will  be 
deformed  by  into  I by  replacing  one  or  several  occurrences 
of  X in  I by  y.  If  x does  not  occur  in  I no  change  is  made.  We 
will  have  to  describe  exactly  how  this  replacement  is  done  when 
we  analyze  0 mathematically,  but  for  the  moment  it  will  be  enough 
to  point  out  that  the  algorithm  could  not  possibly  be  completely 
correct  so  that  we  have  to  introduce  the  probabilities  of  an  error. 
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(3.1)  i 


e = the  probability  that  the  test  says  yes  although 
X ^ y 

6 = the  probability  that  the  test  says  no  although 
X = y. 


It  is  clear  that  e > 0 since  it  can  happen  with  positive  probability 

that  X = y but  that  x can  be  substituted  for  y in  some  sentences 

without  destroying  their  grammatlcality , see  the  last  section. 

The  second  probability,  5,  will  be  zero  though,  since  if  x h y 

then  any  substitution  x -»■  y will  leave  the  sentence  grammatical. 

However,  if  we  allow  for  an  imperfect  teacher,  or,  what  is  the 

same  thing,  that  the  learner  lives  in  an  imperfect  linguistic 

environment  then  we  should  also  allov/  6 to  be  positive. 

To  calculate  e we  must  specify  the  testing  algorithm 

precisely  and  we  present  one  instance  of  such  a calculation. 

For  variations  on  the  testing  algorithm  the  expression  will  not 

hold  exactly  although  it  may  indicate  the  order  of  magnitude  of 

e.  Say  that  we  pick  the  first  occurrence  of  x in  I,  if  any,  and 

replace  it  by  y.  To  find  e for  x and  y fixed,  we  can  reason  as 

xy 

follows . 

The  probability  of  generating  a sentence  with  s in  the  first 
position  and  of  length  L is 

(3.2) 

summed  over  XgjX^,  . . .Xj^G and  over  the  i's  and  interpreted  as 
Tj^(x)  if  L=l.  Similarly,  to  get  a string  with  the  first  occurrence 
(if  any)  of  x in  the  second  position  is 
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(3.3)  .1, 


“L-2"L-1 


summed 


over  x^GV^-x,  X2>  • • • Xj^G  V^p  and  over  the  I's,  and  we  can 


do  this  for  any  position  up  to  L. 


Define 


(3.^) 


1 if  there  is  a rewriting  rule  1 -*•  j 


t . , (x)  = 

0 else 


T^(x)  = 


1 if  there  is  a rewriting  rule  i -»•  F 


else 


In  order  that  the  deformed  image  I~^G.5'  we  must  have  when 
the  first  occurrance  of  x is  the  kth  position 

(3*5)  1 ~ (x.])t.  . (Xp)...t.  j (y)t.  . (x, 

11^  1 i^i^  2 Jk^k+l 

for  some  j -sequence.  The  event  that  I contains  an  x and 
will  then  have  a probability  p^^  that  can  be  obtained  from  the 
expressions  in  (3-2)-(3-3)  by  including  the  (3-5)  right  handside 
as  a Jack  in  the  terms  and  summing  as  Indicated  but  else  over 
L=l,2,3,---  • If  we  do  this  we  get 


(3.6)  p^  = r^(x)xj^(y)  + 


7.3.fi 


+ I I ... 

h=h 


1 1 


I’i  (x  )x  (x,) 
^L-1  ^ Jl-1  ^ 


where  the  second  summation  signs  in  each  term  indicate  summation 
over  I’s  and  J's  and  over  x's  but  only  for  7^  x In  the  second 
sum,  x^  and  x in  the  third  sum,  and  so  on.  The  expression 

can  be  written  more  conveniently  Introducing  the  arrays 

f Px  " ^ = P - P(x) 

£:^x 


(3.7) 


” - Pay(5>‘66<«>> 

“ ' n ' (r„(x)T^(y)) 

S(x,y)  = (PaytxHjjjty)),  d = col  ( 1 , 0 , . . . 0) 


We  can  then  write  (3.6)  as 


t3.8)  P - d'^n  t 2 dTsCx.yjM^-^N  + d’’p  n t £ d’’p  S(x,y)M‘‘-3N 

L=2  X X 


_ _ w 

+d^p^n  + I d^p^S(x,y)M^~\  + .. 


This  gives  us 


(3.9)  p. 


= d*^n  + d'^S(x,y)(I-n)~^H+d'^p  n+d'^p  S ( x , y ) ( I-n)~^N 
+dV  + dVs(x,y)(I-n)~^N  + ... 

= d^(I+P^+,p2+...)n  + d^[l+P^^p2+...]s(x,y)(l-n)-^N 
= d'^(I-P^)~^[n+S(x,y)(I-n)"^N]. 
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Note  that  S(x,y)  and  M are  linear  operators  represented  by 
3-  and  ^-dimensional  arrays  respectively  and  the  second  Identity 
operator  I stands  for  the  dientity  as  indexed  by  four  subscripts 
I = {6 

a6  y<5 

On  the  other  hand  the  prbability  that  a sentence  from  the 
syntax-controlled  probability  model  will  not  contain  any  instance 
of  the  word  x is 

OO 

(3.10)  d>  + E Ep,  . (x,  )p.  , (Xp)...p,  . (x,_,)r.  (x.) 

L=2  ^^1  ^ ^12  L-2  L-1  L-1 

summed  over  / x,  or  using  (3.7) 

(3.11)  d'^r^  + ^E^ 
with  = r-r(x) . 

The  e is  the  conditional  probability  that  x ^ y is  not 
xy 

detected  if  x occurs  in  the  sentence  and  is  replaced  by  y.  Hence 


(3.12)  G, 


l-d^(I-P^)“^x 


The  appearance  of  the  factor  (I-n)”'  in  (3.9)  would  seem  to 
make  this  expression  difficult  to  compute  directly.  It  is  possible 
to  give  a direct  and  helpful  interpretation  of  the  matrix 
Q = (I-n)~^N.  Consider  Figure  3.1  which  represents  the  generation 
of  the  two  strings  nxv  and  nyv  under  consideration.  Returning  to 
(3.6)  it  is  not  difficult  to  see  that  the  entry  of  Q means  the 
probability  of  generating  a string  starting  in  state  y and  ending 
in  F such  that  the  same  terminal  String  also  leads  from  state 


6 to  F.  Q need  not  be  symmetric. 
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The  q's  must  satisfy  the  recursion 


Fif^ure  3.I 


(3.13)  q^j  N^j  + I q^g 

ot , p 

which  can  also  be  seen,  of  course,  from  the  algebraic  definition 
of  Q.  What  is  more  important,  however,  is  to  realize  that  q^j 
must  be  zero  if  there  is  no  word  g that  exits  from  both  i and  j . 

In  other  words  q,  , can  be  positive  only  if  there  are  rev/riting 

rules  i -*■  k,  J . 

This  implies  that  Q will  be  an  extremely  sparse  matrix  where 
we  can  a priori  fill  in  lots  of  zeroes  Just  by  looking  at  the 
diagram.  Further,  for  a main  diagonal  element,  this  value 

means  Just  the  probability  of  generating  a string  leading  from 
state  i to  F.  According  to  the  assumptions  that  we  have  adopted 
this  probability  will  automatically  be  equal  to  one. 
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Introduce  a relation  E between  states  and  with  iEj  if 

5 Vrp  and  k,  such  that  i ^ k,  j i. 

Let  E*  be  the  transitive  closure  of  E so  that  E*  is  an 

equivalence  relation.  Then  E*  partitions  the  set  of  states  into 

classes  and  the  Q-matrix  be  partitioned  into  block  structure, 

with  the  rows  and  columns  subscripts  of  tiie  blocks  corresponding 

to  the  classes  induced  by  E*.  The  computational  pi’Oblcm  associated 

with  (3.12)  is  therefore  manageable. 

Combining  (3.11)  with  (3.9)  we  get  the  probability  e that 

xy 

the  deformation  d(x  -*■  y)  does  not  detect  that  x i y as 
(3.14)  e^y  = d'^(I-P^)"^[r^+n+S(x,y)Q]. 

Repeating  this  deformation  on  successive  sentences  we  can  therefore 

expect  to  have  to  repeat  this  a number  of  times,  with  tlie  number 

of  the  order  1/1-e  . For  large  values  of  e considerable  testing 

xy  xy 

will  be  needed  but  we  shall  try  to  reduce  this  by  improving  the 
form  of  the  partitioning  algorithms. 

It  was  observed  empirically  when  the  described  was 
simulated,  that  the  selection  of  x from  I could  be  done  better. 
Mechanical  criteria  as  selecting  the  first  occurrence  of  x,  where 
X itself  Is  picked  at  random  from  is  wasteful.  The  same  holds 
for  selecting  one  at  random  of  the  (possible)  occui’rences  of  x in  I 
Instead  we  should  select  x and  y in  such  a way  that  we  test 
critical  choices,  where  we  have  reason  to  really  suspect  that  x ^ y 
and  not  wast  our  effort  by  testing  x and  y often  if  we  already 
believe  that  x = y.  This  idea  was  implemented  as  follows. 
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Partitioning  Algorithm:  Step  1.  Initialize  by  creating  a single 

class  consisting  of  all  and  choose  one  word  as  the  prototype 
of  the  class. 

Step  2.  The  algorithm  LISTEN  produces  a sentence  from 
L(Gr)  according  to  the  (unknown)  syntax  controlled  probability 
model . 

Step  3.  The  algorithm  ATTENTION  (see  below)  selects  an 

xe  I. 

step  The  algorithm  SPEAK  produces  the  deformed  string 
I = d(x  -*■  y)I  with  y equal  to  the  prototype  of  the  class  x 
belongs  to  currently.  The  grammatlcality  of  I"'  is  obtained. 

Go  to  Step  6 if  gr(I^)  = FALSE. 

Step  5.  If  gr(I^)  = TRUE  the  algorithm  STRENGTHEN  Increases 
the  plausibility  of  our  belief  that  x h y by  moving  x closer  to 
y in  this  class  If  possible.  In  other  words  the  positions  of  x 
and  the  element  next  closer  to  y are  permuted  unless  the  next 
element  happens  to  be  y.  Go  to  Step  2 (or  stop). 

Step  6.  Move  x to  the  end  of  the  next  class  if  there  is 
one,  else  go  to  Step  7,  and  go  to  Step  . 

Step  7.  Create  a new  class  v;lth  x as  its  single  element 
and  ototype. 

The  lists  representing  the  provisional  classes  can  be 
visualized  as  in  Figure  3-2  where  typical  movements  of  words  have 
been  indlrected. 

The  algorithm  ATTENTION  Is  intended  to  avoid  wasteful  testing 
and  concentrate  the  abduction  process  on  hypotheses  that  seem 


uncertain  at  the  moment.  It  considers  the  set  of  words  in  the 
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current  class  and  is  convenient  to  think  of  tlicin  as  a list  ordered 
from  left,  with  the  prototype  far  left,  to  right.  For  each  word 
measure  its  distance  from  the  leftmost  element,  the  prototype. 
ATTENTION  selects  one  word  at  random  but  not  with  uniform 
distribution  but  with  probabilities  monotonically  increasing  with 
the  distance. 

The  algorithm  STRENGTHEN  updates  our  current  belief  that 
X = y by  moving  x leftwards  into  a position  associated  with  a 
higher  plausibility  of  being  equivalent  to  the  prototype. 

Theorem  3.1.  The  algorithm  produces  partitions  that  converge 
with  probability  in  a finite  number  of  steps  one  to  the  true 
partition. 

Proof:  Consider  the  sequence  of  sentences  produced  by  LISTEN, 

l(l)^l(2)^ . . l(t)^ . . . denote  by  c^  the  number  of  classes 

established  after  t sentences  have  been  heard.  Since  c^  Is  non- 
decreasing  In  c and  bounded  by  n^  It  converges  to  some  limiting 
random  variable  c . If  c Is  less  than  the  true  number  of  classes 

CO  CO 

there  exists  at  least  one  word,  say  x,  not  equivalent  to  any  of 
the  c^  prototypes.  Let  E be  the  probability  given  the  c^ 
prototypes,  that  a sentence  is  produced  that  contains  x,  that  x is 
selected  and  tested  against  each  prototype  and  that  these  tests 
fall.  Then  P(E)  is  positive,  although  perhaps  quite  small.  If  E 
occurs  then  a new  class  will  be  established  by  Step  7 in  the 
algorithm.  Hence  this  will  happen  with  probability  one  for  the 
infinite  sequence  of  sentences  produced  so  that  c^  must  be  equal 
to  the  truer  number  of  classes.  The  way  the  prototypes  have  been 
selected  they  are  all  mutually  non-equivalent.  Therefore,  in  the 
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PROTO£ 


^Jll 

^12 

• • • 

“3n^-l 


‘3n. 


Figure  3.2 


probability  one,  they  represent  each  of  the  true  equivalence  classes. 

The  movements  caused  by  the  algorithm  In  Figure  3.2  Is 
partly  within  classes  and  partly  between  classes.  The  first  type 
of  movement  does  not  Influence  the  partition  (directly).  The 
second  type  moves  a word  downwards  to  another  class  or  to  a new 
class.  This  Implies  that  once  n^  has  been  reached  the  algorithm 
only  moves  words  away  from  classes  to  v;hlcli  they  do  not  belong. 

A word  X will  not  stay  for  more  than  a finite  number  of  situations 
In  the  wrong  class.  Hence  the  partitions  converge  after  a finite 
number  of  steps  to  the  true  partition. 

The  theorem  guarantees  that  this  pattern  processor  Is 
consistent  but  It  does  not  say  anything  about  the  speed  of  the 
convergence.  To  learn  about  this  the  algorithm  has  been  Implemented, 
actually  In  several  quite  different  versions,  and  executed  on  the 
computer . 

For  the  test  grammar  of  section  7.2  words  a classified 
rapidly  Into  equivalence  classes  during  the  early  part  of  execution. 
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The  learning  rate  then  slows  down  considerably.  For  a typical 
run  after  150-200  sentences  have  been  heard  arid  processed  most 
of  the  23  equivalence  classes  have  been  establlslicd  with  a 
couple  of  words  misclassif ied  out  of  the  52. 

Graphically  this  looks  like  Figure  3.3  showing  the  number 
of  words  classified  correctly  with  number  of  sentences  as  abscissas, 
and  Figure  3.^  which  shows  number  of  discovered  word  classes. 

In  some  respect  this  abduction  algorithm  satisfies  the 
requirements  of  section  7.1.  Whether  it  is  "natural"  or  not  may 
be  answered  differently  by  different  persons  but  at  least  it 
appears  more  natural  than  some  alternative  ones.  Tt  is  fairly 
fast,  although  we  do  not  claim  any  optimality.  It  is  insensitive 
to  the  e-error  which  plays  a fundamental  role  in  plausibility 
inference . 

If  we  change  the  pattern  structure,  however,  the  algorithm 
appears  less  attractive.  More  precisely,  if  the  answer  to  the 
learner  as  to  the  value  of  gr(I^)  is  not  always  correct  so  that 
the  learner  will  be  told  that  I^€.^  although  I^€-^  we  have 
6 > 0,  see  last  section. 

Senstlnlzing  the  algorithm  it  can  then  be  seen  that  Step  7 
may  be  taken  when  x is  actually  equivalent  to  the  prototype  of 
one  of  the  provisional  equivalence  classes.  Hence  it  can  happen 
with  positive  probability  that  too  many  provisional  equivalence 
classes  are  set  up.  Since  the  algorithm  has  no  step  involving 
coalescence  of  equivalence  this  mistake  will  never  be  corrected. 

Therefore  the  algorithm  is  not  robust  to  small  (6  small) 
changes  in  the  pattern  structure  of  the  linguistic  environment  of 
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the  learner.  The  question  arises  hov;  to  compensate  for  this. 

One  possibility  Is  to  include  a coalescing  step  into  the 
algorithm  to  be  executed  only  occasionally.  Although  it  may  be 
possible  to  do  this  it  will  not  be  attempted  since  it  would 
destroy  the  elegant  simplicity  of  the  algorithm.  Instead  a very 
different  sort  of  algorithm  will  be  examined. 
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Notes 


The  approach  in  this  section  was  strongly  influence^  by 
ideas  due  to  the  late  A.  Spacek  on  deduction-induction  under 
imperfect  conditions.  An  example  of  such  ideas  can  be  found  in 

V V 

Spacek  (I960).  Unfortunately  Spacek  was  not  given  the  opportunity 
to  complete  his  innovative  thinking.  His  published  work  on  this 
topic  deserves  to  be  better  known. 

Pattern  inference  in  general,  Including  pattern  abduction, 
can  be  viewed  as  inductive  behavior,  just  as  J.  Neyman  (1950,1966) 
suggested  that  statistical  inference  can  be  described  by  this 
term.  See  also  sectionl.l  in  the  current  volume. 

There  is  also  some  relation  to  work  in  artificial  intelligence: 
mechanization  of  proofs,  heuristic  programs,  etc.  The  reader  could 
consult  e.g.  Hunt  (1975),  see  Chapters  IX  to  XII  in  particular, 
and  the  bibliography. 

A particularly  Interesting  attempt  to  formalize  the  induction 
process  is  due  to  R.  Solomonoff,  see  Solomonoff  (196ila,b). 
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Motes 

7.2.  Grammatical  inference  in  general  Is  what  grammarians  have 
been  doing  for  thousands  of  years.  Formal  grammatical  inference 
is  of  more  recent  origin  but  even  so  the  literature  is  already 
voluminous.  Most  of  it  is  only  marginally  related  to  the 
abduction  study  in  this  section.  The  interested  reader  may  be 
referred  to  Hunt  (1975),  Chapter  VII,  Fu  ( ),  Chapter 

where  many  more  references  can  be  found.  See  also  Fatal  (1972), 
Maryanski  (197^)- 

More  relevant  to  this  section  is  the  early  discussion  in 
Miller-Chomsky  (1957)  which  suggests  substitution  as  a natural 
principle  on  which  to  base  the  algorithm.  Another  important 
reference  is  Solomonoff  ( ) which  is  based  on  the 

theorem.  It  is  not  known  to  the  author  if  these  early  attempts 
were  followed  up  by  detailed  analysis  of  robustness  error 
sensitivity  etc. 

The  material  in  sections  7.2-  is  based  on  a study  begun 

in  197  as  a result  of  a discussion  between  the  author  and 
L.  Cooper,  W.  Freiberger,  and  H.  Kucera.  Some  preliminary  results 
were  reported  in  Grenander  ( ) but  the  algorithms  were  not 

analyzed  in  sufficient  detail.  A more  complete  analysis  was 
presented  in  Shrier  (1977  ) together  with  mathematical  experiments 
illustrating  the  strengths  and  weaknesses  of  one  particular 
abduction  scheme. 


The  choice  of  finite  state  languages  is  very  restrictive. 

It  should  be  remarked  here  that  our  goal  is  not  to  study  abduction 


of  natural  language,  but  to  see  how  abduction  can  be  organized  in 
a concrete  setting  and  what  are  the  mathematical  difficulties 
that  one  will  then  encounter,  e.g.  the  determination  of  e and  6 , 
robustness,  convergence. 

When  choosing  similarity  transformations  one  has  to  decide 
what  are  the  relevant  properties  that  we  want  to  concentrate  on, 
what  sort  of  "sameness"  are  we  interested  in  at  the  moment. 

Often  we  will  have  to  operate  with  more  than  one  S.  In  the 
present  section  we  want  to  examine  the  validity  of  tlie  bonds  which 
leads  to  the  definition  used. 


